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Abstract. In this paper we address the problem of protecting elliptic 
curve scalar multiplication implementations against side-channel analysis 
by using the atomicity principle. First of all we reexamine classical as- 
sumptions made by scalar multiplication designers and we point out that 
some of them are not relevant in the context of embedded devices. We 
then describe the state-of-the-art of atomic scalar multiplication and pro- 
pose an atomic pattern improvement method. Compared to the most ef- 
ficient atomic scalar multiplication published so far, our technique shows 
an average improvement of up to 10.6%. 

Keywords: Elliptic Curves, Scalar Multiplication, Atomicity, Side-Chan- 
nel Analysis. 



1 Introduction 
1.1 Preamble 

We consider the problem of performing scalar multiplication on elliptic curves 
over F p in the context of embedded devices such as smart cards. In this con- 
text, efficiency and side-channel resistance are of utmost importance. Concerning 
the achievement of the first requirement, numerous studies dealing with scalar 
multiplication efficiency have given rise to efficient algorithms including sliding- 
window and signed representation based methods [19]. 

Regarding the second requirement, side-channel attacks exploit the fact that 
physical leakages of a device (timing, power consumption, electromagnetic radia- 
tion, etc) depend on the operations performed and on the variables manipulated. 
These attacks can be divided into two groups: the Simple Side-Channel Analysis 
(SSCA) [25] which tries to observe a difference of behavior depending on the 
value of the secret key by using a single measurement, and the Differential Side- 
Channel Analysis (DSCA) [26] which exploits data value leakages by performing 
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statistical treatment over several hundreds of measurements to retrieve informa- 
tion on the secret key. Since 1996, many proposals have been made to protect 
scalar multiplication against these attacks [7,12,23]. Amongst them, atomic- 
ity introduced by Chevallier-Mames et al. in [9] is one of the most interesting 
methods to counteract SSCA. This countermeasure has been widely studied and 
Longa recently proposed an improvement for some scalar multiplication algo- 
rithms [27]. 

In this paper we present a new atomicity implementation for scalar multi- 
plication, and we detail the atomicity improvement method we employed. This 
method can be applied to minimize atomicity implementation cost for sensitive 
algorithms with no security loss. In particular our method allows the implemen- 
tation of atomic scalar multiplication in embedded devices in a more efficient 
way than any of the previous methods. 

The rest of this paper is organized as follows. We finish this introduction 
by describing the scalar multiplication context which we are interested in and 
by mentioning an important observation on the cost of field additions. In Sec- 
tion 2 we recall some basics about Elliptic Curves Cryptography. In particular 
we present an efficient scalar multiplication algorithm introduced by Joye in 
2008 [21]. Then we recall in Section 3 the principle of atomicity and we draw 
up a comparative chart of the efficiency of atomic scalar multiplication algo- 
rithms before this work. In Section 4, we propose an improvement of the original 
atomicity principle. In particular, we show that our method, applied to Joye's 
scalar multiplication, allows a substantial gain of time compared to the original 
atomicity principle. Finally, Section 5 concludes this paper. 

1.2 Context of the Study 

We restrict the context of this paper to practical applications on embedded 
devices which yields the constraint of using standardized curves over F p 4 . As far 
as we know, NIST curves [17] and Brainpool curves [14,15] cover almost all curves 
currently used in the industry. We thus exclude from our scope Montgomery 
curves [32], Hessian curves [20], and Edwards curves 5 [16] which do not cover 
NIST neither Brainpool curves. 

Considering that embedded devices - in particular smart cards - have very 
constrained resources (i.e. RAM and CPU), methods requiring heavy scalar 
treatment are discarded as well. In particular it is impossible to store scalar pre- 
computations for some protocols such as ECDSA [1] where the scalar is randomly 
generated before each scalar multiplication. Most of the recent advances in this 

4 The curves over F p are generally recommended for practical applications [33,34]. 

5 An elliptic curve over ¥ p is expressible in Edwards form only if it has a point of 
order 4 [6] and is expressible in twisted Edwards form only if it has three points 
of order 2 [4]. Since NIST and Brainpool curves have a cofactor of 1 there is not 
such equivalence. Nevertheless, for each of these curves, it is possible to find an 
extension field ¥ pq over which the curve has a point of order 4 and is thus birationally 
equivalent to an Edwards curve. However the cost of a scalar multiplication over F p <? 
is prohibitive in the context of embedded devices. 
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field cannot thus be taken into account: Double Base Number System [13,31], 
multibase representation [28], Euclidean addition chains and Zeckendorf repre- 
sentation [30]. 

1.3 On the Cost of Field Additions 

In the literature, the cost of additions and subtractions over ¥ p is generally 
neglected compared to the cost of field multiplication. While this assumption 
is relevant in theory, we found out that these operations were not as insignifi- 
cant as predicted for embedded devices. Smart cards for example have crypto- 
coprocessors in order to perform multi-precision arithmetic. These devices gener- 
ally offer the following operations: addition, subtraction, multiplication, modular 
multiplication and sometimes modular squaring. Modular addition (respectively 
subtraction) must therefore be carried out by one classical addition (resp. sub- 
traction) and one conditional subtraction (resp. addition) which should always 
be performed - i.e. the effective operation or a dummy one - for SSCA immunity. 
Moreover every operation carried out by the coprocessor requires a constant ex- 
tra software processing S to configure the coprocessor. As a result, the cost of 
field additions/subtractions is not negligible compared to field multiplications. 
Fig. 1 is an electromagnetic radiation measurement during the execution on a 
smart card of a 192-bit modular multiplication followed by a modular addition. 
Large amplitude blocks represent the 32-bit crypto-coprocessor activity while 
those with smaller amplitude are only CPU processing. In this case the time 
ratio between modular multiplication and modular addition is approximately 
0.3. 




Fig. 1. Comparison between modular multiplication (M) and modular addition (A) 
timings. 
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From experiments on different smart cards provided with an arithmetic co- 
processor, we estimated the average cost of modular additions/subtractions com- 
pared to modular multiplications. Our results are presented in Table 1 where A 
and M denote the cost of a field addition/subtraction and the cost of a field mul- 
tiplication respectively. We observe that the average value of A/M for considered 
bit lengths is about 0.2. 



Bit length 


160 


192 


224 


256 


320 


384 


512 


521 


A/M 


0.36 


0.30 


0.25 


0.22 


0.16 


0.13 


0.09 


0.09 



Table 1. Measured A/M ratio on smart cards with crypto-coprocessor for NIST and 
Brainpool ECC bit lengths. 

Another useful field operation is negation in F p , i.e. the map x — > —x, which 
can be carried out by one non-modular subtraction p — x. The cost N of this 
operation is therefore half the cost of modular addition/subtraction and thus we 
fix N/M = 0.5 A/M. 

In the following sections we also consider the cost S of field squaring. The 
cost of a squaring compared to a multiplication depends on the functionalities 
of the corresponding crypto-coprocessor. When a dedicated squaring is available 
a commonly accepted value for S/M is 0.8 [8, 18] which is also corroborated by 
our experiments. Otherwise squarings must be carried out as multiplications and 
the ratio S/M is thus 1. 

2 Elliptic Curves 

In this section we recall some generalities about elliptic curves, and useful point 
representations. Then we present two efficient scalar multiplication algorithms. 

Cryptology makes use of elliptic curves over binary fields F2« and large char- 
acteristic prime fields ¥ p . In this study we focus on the latter case and hence 
assume p > 3. 

2.1 Group Law Over F p 

An elliptic curve £ over F p , p > 3 can be defined as an algebraic curve of affinc 
Weierstrafi equation: 

£ : y 2 = x 3 + ax + b (1) 

where a, b e ¥ p and 4a 3 + 27b 2 ^ (mod p). 

The set of points of £ - i.e. the pairs (x, y) S F p 2 satisfying (1) -, plus an 
extra point O called point at infinity form an abelian group where O is the 
neutral element. In the following, we present the corresponding law depending 
on the selected point representation. 
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Affine Coordinates. Under the group law a point P = (xi,y\) lying on the 
elliptic curve £ admits an opposite — P = (x\, — j/i). 

The sum of P = (xi,yi) and Q = (.1:2,2/2), with P,Q ^ O and P ^ ±Q, is 
the point P + Q = (2:3, 2/3) such that: 

^3 = ((V2 ~ Vl)/{X2 - Xi)) 2 - Xi - X 2 , 2 . 
2/3 = (xi - x 3 )(y 2 - 2/i)/(x 2 -xi)-yi 

The double of the point P = (xi,yi), with P ^ O and 2/1 ^ 0, is the point 
2P = (x2, 2/2) as defined below, or if 2/1 =0. 

x 2 = ((3xi 2 +a)/(2 yi )) 2 -2xi 

2/2 = (xi - x 2 )(3x! 2 + a)/(2y x ) - 2/1 W 

Each point addition or point doubling requires an inversion in F p . This oper- 
ation can be very time consuming and leads developers on embedded devices to 
use other kinds of representations with which point operations involve no field 
inversion. In the following part of this section, we detail two of them. 

Jacobian Projective Coordinates. By denoting x = X/Z 2 and y = Y/Z 3 , 
Z 0, we obtain the Jacobian projective Weierstrafi equation of the elliptic 
curve £ : 

Y 2 = X 3 + aXZ 4 + bZ 6 , (4) 

where a, b £ ¥ p and 4a 3 + 27& 2 7^ 0. Each point P = (x, y) can be represented by 
its Jacobian projective coordinates (g 2 x : q 3 y : q) with q £ F p . Conversely, every 
point P = (X : Y : Z) different from O can be represented in affine coordinates 
by (x,y) = (X/Z 2 ,Y/Z 3 ). 

The opposite of a point (X : Y : Z) is (X : —Y : Z) and the point at infinity 
O is denoted by the unique point with Z = 0, O = (1 : 1 : 0). 

The sum of P = (X 1 : Y x : Z x ) and Q = (X 2 : Y 2 : Z 2 ), with P,Q ^ O and 
P i= ±Q, is the point P + Q = (X 3 : Y 3 : Z 3 ) such that: 

A =X X Z 2 2 

X 3 = F 2 ~E 3 -2AE 2 rZl 2 yi 
Y 3 =F(AE 2 -X 3 )-CE 3 with C ~v 7 \ (5) 

312 E=B-A 

F = D-C 

If P is given in affine coordinates - i.e. Z\ = 1 — it is possible to save up one 
field squaring and four multiplications in (5). Such a case is referred to as mixed 
affine- Jacobian addition. On the other hand if P has to be added several times, 
storing Z\ 2 and Z\ 3 saves one squaring and one multiplication in all following 
additions involving P. This latter case is referred to as readdition. 
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The double of the point P = (X x : Y x : Z x ) is the point 2P = (X 2 : Y 2 : Z 2 ) 
such that: 

( X 2 =C 2 -2B A = 2Yi 2 

< Y 2 =C(B-X 2 )- 2A 2 with B = 2AX X (6) 

[Z 2 = 1Y X Z X C = 3X X 2 + aZ x 4 

When curve parameter a is —3, doubling can be carried out taking C = 
3 (X x + Z x 2 } [X\ — Z 2 ~) which saves two squarings in (6). We denote this op- 
eration by fast doubling. 

Adding up field operations yields 12M + AS + 7A for general addition, 11M + 
35 + 7 A for readdition, SM + 35 + 7 A for mixed addition, AM + 65 + 11 A for 
general doubling formula and AM + AS + 12A for fast doubling. 



Modified Jacobian Projective Coordinates. This representation, intro- 
duced in [11], is derived from the Jacobian projective representation to which a 
fourth coordinate is added for computation convenience. In this representation, a 
point on the curve £ is thus represented by {X : Y : Z : aZ 4 ), where (X :Y : Z) 
stands for the Jacobian representation. 

Modified Jacobian projective coordinates provide a particularly efficient dou- 
bling formula. Indeed, the double of a point P = (X x :Y\\ Z\\ W\) is given by 
2P = (X 2 :Y 2 : Z 2 :W 2 ) such that: 

X 2 = A 2 - 2C A = 3Xi 2 + Wi 

Y 2 =A{C~X 2 )~D B = 2Y X 2 

Z 2 =2Y 1 Z 1 C = 2BX 1 {) 

W 2 = 2DW 1 D = 2B 2 

Doubling hence requires only AM + AS + 12A for all a values. On the other 
hand, addition is less efficient compared to Jacobian projective representation: 
by applying formula (5), we need to compute the fourth coordinate which is 
required in point doubling, adding an overhead of 1M + 25 [21]. 



On S— M TVade-OfFs. Addition and doubling formulas presented above are 
voluntarily not state-of-the-art, see [5]. Indeed, recent advances have provided 
Jacobian formulas where some field multiplications have been traded for faster 
field squarings [27, Sec. 4.1]. These advances have been achieved by using the 
so-called S-M trade- off principle which is based on the fact that computing ab 
when a 2 and b 2 are known can be done as 2ab = (a + b) 2 — a 2 — b 2 . This allows a 
squaring to replace a multiplication since the additional factor 2 can be handled 
by considering the representative of the Jacobian coordinates equivalence class 
(X : Y : Z) = {2 2 X : 2 3 Y : 2Z). 

Nevertheless such trade-offs not only replace field multiplications by field 
squarings but also add field additions. In the previous example at least 3 extra 
additions have to be performed, thus taking S/M = 0.8 implies that the trade-off 
is profitable only if AjM < 0.067 which is never the case with devices considered 
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using standardized curves as seen in Section 1.3. These new formulas are thus 
not relevant in the context of embedded devices. 

2.2 Scalar Multiplication 

Generalities. The operation consisting in calculating the multiple of a point 
k-P = P + P+ -- - + P (k times) is called scalar multiplication and the integer 
k is thus referred to as the scalar. 

Scalar multiplication is used in ECDSA signature [1] and ECDH key agree- 
ment [2] protocols. Implementing such protocols on embedded devices requires 
particular care from both the efficiency and the security points of view. In- 
deed scalar multiplication turns out to be the most time consuming part of the 
aforementioned protocols, and since it uses secret values as scalars, side-channel 
analysis endangers the security of those protocols. 

Most of the scalar multiplication algorithms published so far are derived from 
the traditional double and add algorithm. This algorithm can scan the binary 
representation of the scalar in both directions which leads to the left-to-right 
and right-to-left variants. The former is generally preferred over the latter since 
it saves one point in memory. 

Moreover since computing the opposite of a point P on an elliptic curve is 
virtually free, the most efficient methods for scalar multiplication use signed 
digit representations such as the Non- Adjacent Form (NAF) [3]. Under the NAF 
representation, an n-bit scalar has an average Hamming weight of n/3 which 
implies that one point doubling is performed every bit of scalar and one point 
addition is performed every three bits. 

In the two next subsections, we present a left-to-right and a right-to-left NAF 
scalar multiplication algorithms. 

Left To Right Binary NAF Scalar Multiplication. Alg. 1 presents the 
classical NAF scalar multiplication algorithm. 



Algorithm 1 Lcft-to-right binary NAF scalar multiplication [19] 

Inputs : P = (Jd : Y 1 : Zi) e £ (F p ), k = (ki-i . . . fcifc ) N AP 
Output : k ■ P 

1. (X 2 : Y 2 : Z a ) <- (Xi : Yx : Zi) 

2. i <- I - 2 

3. while i > do 

(X 2 : Y 2 : Z 2 ) <- 2 ■ (X 2 : Y 2 : Z 2 ) 
if ki = 1 then 

(X a : Y 2 : Za) <- {X 2 : Y 2 : Za) + (Xj : Yj ■ Z x ) 
if ki = — 1 then 

(Xa : Y 2 : Za) *r- (X 2 : Y 2 : Za) - (Xi : Y x : Zi) 
i <- i — 1 

4. return (Xa : Y 2 : Z 2 ) 
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Point doubling can be done in Alg. 1 using general Jacobian doubling formula 
or fast doubling formula. Since NIST curves fulfill a = — 3 and each Brainpool 
curve is provided with an isomorphism to a curve with a = — 3, we thus assume 
that fast doubling is always possible. Point addition can be performed using 
mixed addition formula if input points are given in affine coordinates or by 
using rcaddition formula otherwise. 

It is possible to reduce the number of point additions by using window tech- 
niques 6 which need the precomputation of some first odd multiples of the point 
P. Tabic 2 recalls the number of point additions per bit of scalar when having 
from (simple NAF) to 4 precomputed points. More than 4 points allows even 
better results but seems not practical in the context of constrained memory. 



Nb. of precomp. points 





1 


2 


3 


4 


Precomputed points 




3P 


3P,5P 


3P, 5P, 7P 


3P, . . . , 9P 


Point additions / bit 


1/3 « 0.33 


1/4 = 0.25 


2/9 « 0.22 


1/5 = 0.20 


4/21 w 0.19 



Table 2. Average number of point additions per bit of scalar using window NAF 
algorithms. 



Right To Left Binary NAF Mixed Coordinates Multiplication. We 

recall here a very efficient algorithm performing right-to-left NAF scalar multi- 
plication. Indeed this algorithm uses the fast modified Jacobian doubling formula 
which works for all curves - i.e. for all a - without needing the slow modified 
Jacobian addition. 

This is achieved by reusing the idea of mixed coordinates scalar multiplica- 
tion (i.e. two coordinate systems are used simultaneously) introduced by Cohen, 
Ono and Miyaji in [11]. The aim of this approach is to make the best use of 
two coordinates systems by processing some operations with one system and 
others with the second. Joye proposed in [21] to perform additions by using Ja- 
cobian coordinates, doublings - referred to as * - by using modified Jacobian 
coordinates, and to compute the NAF representation of the scalar on-the-fly, cf. 
Alg. 2 7 . 

In the same way as their left-to-right counterpart benefits from precomputed 
points, right-to-left algorithms can be enhanced using window techniques if ex- 
tra memory is available [22,35]. In this case precomputations are replaced by 
postcomputations the cost of which is negligible for the considered window sizes 
and bit lengths. 

In [21] the author suggests protecting Alg. 2 against SSCA by using the 
so-called atomicity principle. We recall in the next section the principle of this 
SSCA countcrmeasure. 

6 By window techniques we mean the sliding window NAF and the Window NAF„ 
algorithms, see [19] for more details. 

7 In Alg. 2, Jacobian addition is assumed to handle the special cases P = ±Q, P = O, 
Q — O as discussed in [21]. 
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Algorithm 2 Right-to-left binary NAF mixed coordinates multiplication [21] 
Inputs : P = (Jd : Y 1 : Zi) € £ (F p ), k 

Output : k ■ P 

1. (X 2 : Y 2 : Z 2 ) <- (1 : 1 : 0) 

2. (Ri :R 2 :R 3 : Ri) <- {X x :Y X :Z X : aZ 1 i ) 

3. while k > 1 do 

if k = 1 mod 2 then 
u 4 — 2 — (k mod 4) 
fc <— fc — u 
if u = 1 then 

(X 2 : K 2 : Z 2 ) <~ (X 2 : Y 2 : Z 2 ) + (Rt : R 2 : R 3 ) 
else 

(X 2 : Y 2 : Z 2 ) <r- (X 2 : Y 2 : Z 2 ) + (R 1 : -R 2 : R 3 ) 
k <- k/2 

(Ri : R 2 : R 3 : Ri) <- 2 * : R 2 : R 3 : i? 4 ) 

4. {X 2 : Y 2 : Z 3 ) (X 2 : Y 2 : Z 2 ) + (Hi : i? 2 : fl 3 ) 

5. return (X 2 : F 2 : Z 2 ) 



3 Atomicity 

In this section we recall the principle of atomicity and its application to scalar 
multiplication. Other countermeasures exist in order to thwart SSCA such as 
regular algorithms [12,22,24] and unified formulas [7,16]. However regular algo- 
rithms require costly extra curve operations, and unified formulas for Wcicrstrass 
curves over ¥ p - only known in the affine and homogeneous coordinate systems, 
see [7] - are also very costly. Therefore atomicity turns out to be more efficient 
in the context of embedded devices. It is thus natural to compare the efficiency 
of the two scalar multiplication methods presented in Section 2.2 protected by 
atomicity. 

We recall in the following how atomicity is generally implemented on elliptic 
curves cryptography, for a complete atomicity principle description see [9]. 

3.1 State-of-the-Art 

The atomicity principle has been introduced in [10]. This countermeasure con- 
sists in rewriting all the operations carried out through an algorithm into a 
sequence of identical atomic patterns. The purpose of this method is to defeat 
SSCA since an attacker has nothing to learn from an uniform succession of iden- 
tical patterns. 

In the case of scalar multiplications, a succession of point doublings and 
point additions is performed. Each of these operations being composed of field 
operations, the execution of a scalar multiplication can be seen as a succession of 
field operations. The atomicity consists here in rewriting the succession of field 
operations into a sequence of identical atomic patterns. The atomic pattern (1) 
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proposed in [9] is composed of the following field operations: a multiplication, 
two additions and a negation. R^s denote the crypto-coprocessor registers. 

R± 4— i?2 ■ R3 
^ Ra <— R5 + Re 
Rj 4 — —Rs 
_i?9 4— Rio + Rn 

This choice relies on the observation that during the execution of point additions 
and point doublings, no more than two additions and one negation are required 
between two multiplications. Atomicity consists then of writing point addition 
and point doubling as sequences of this pattern - as many as there are field 
multiplications (including squarings). 

Therefore this countcrmcasure induces two kinds of costs: 

— Field squarings have to be performed as field multiplications. Then this 
approach is costly on embedded devices with dedicated hardware offering 
modular squaring operation, i.e. when S/M < 1. 

— Dummy additions and negations are added. Their cost is generally negligible 
from a theoretical point of view but, as shown in Section 1.3, the cost of such 
operations must be taken into account in the context of embedded devices. 

To reduce these costs, Longa proposed in his PhD thesis [27, Chap. 5] the 
two following atomic patterns in the context of Jacobian coordinates: 





'Ri 


<— i?2 • R3 




4- 


-R2 2 




i?4 


<- -R 5 




R 3 « 


i?4 




i?6 


4— i?7 + R$ 




R 5 « 


- Rq + R7 


(2) 


i?9 


4— Rio ■ Rn 


(3) 


Ra « 


- Rx) ■ Rio 




Rl2 


<- -Rn 




Rn « 


R12 




Rl4 


4— Ri 5 + i?i 6 




R13 * 


- Rl4 + Rl5 




Rn 


4— Ris + Rl9 




Rl6 4- 


- Rn + -Ris 



Compared with atomic pattern (1), these two patterns slightly reduce the 
number of field additions (gain of one addition every two multiplications). More- 
over, atomic pattern (3) takes advantage of the squaring operation by replacing 
one multiplication out of two by a squaring. 

In [27, Appendices] Longa expresses mixed affine- Jacobian addition formula 
as 6 atomic patterns (2) or (3) and fast doubling formula as 4 atomic patterns (2) 
or (3). It allows to perform an efficient left-to-right scalar multiplication using 
fast doubling and mixed affine- Jacobian addition protected with atomic patterns 
(2) or (3). 

3.2 Atomic Left-to-Right Scalar Multiplication 

We detail in the following why the Longa's left-to-right scalar multiplication 
using fast doubling and mixed affine- Jacobian addition is not compatible with 
our security constraints. 
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Defeating DSCA 8 requires the randomization of input point coordinates. 
This can be achieved by two means: projective coordinates randomization [12] 
and random curve isomorphism [23]. The first one allows to use the fast point 
doubling formula but prevents the use of mixed additions since input points 
P, 3P, . . . have their Z coordinate randomized. On the other hand the random 
curve isomorphism keeps input points in affinc coordinates but randomizes a 
which thus imposes the use of the general doubling formula instead of the fast 
one. 

Since Longa didn't investigate general doubling nor rcaddition, we present in 
Appendix A.l the formulas to perform the former by using 5 atomic patterns (2) 
or (3) and in Appendix A. 2 the formulas to perform the latter by using 7 atomic 
patterns (2). It seems very unlikely that one can express readdition using atomic 
pattern (3): since state-of-the-art readdition formula using the S-M trade-off 
requires 10 multiplications and 4 squarings, 3 other multiplications would have 
to be traded for squarings. 

Therefore secure left-to-right scalar multiplication can be achieved either by 
using atomic pattern (2) and projective coordinates randomization which would 
involve fast doublings and rcadditions or by using atomic pattern (3) and random 
curve isomorphism which would involve general doublings and mixed additions. 

3.3 Atomic Right-to-Left Mixed Scalar Multiplication 

As suggested in [21] we protected Alg. 2 with atomicity. Since Longa's atomic 
patterns have not been designed for modified Jacobian doubling, wc applied 
atomic pattern (1) to protect Alg. 2. 

The decomposition of general Jacobian addition formula in 16 atomic pat- 
terns (1) is given in [9]. Since we haven't found it in the literature, we present 
in Appendix A. 3 a decomposition of modified Jacobian doubling formula in 8 
atomic patterns (1). 

Projective coordinates randomization and random curve isomorphism coun- 
termeasures can both be applied to this solution. 

3.4 Atomic Scalar Multiplication Algorithms Comparison 

Wc compare in Tabic 3 the three previously proposed atomically protected algo- 
rithms. As discussed in Section 1.3 we fix A/M = 0.2 and N/M = 0.1. Costs are 
given as the average number of field multiplications per bit of scalar. Each cost is 
estimated for devices providing dedicated modular squaring - i.e. S/M = 0.8 - 
or not - i.e. S/M = 1. If extra memory is available, prccomputations or postcom- 
putations are respectively used to speed up left-to-right and right-to-left scalar 
multiplications. The pre/postcomputation cost is here not taken into account 
but is constant for every row of the chart. 



We include in DSCA the Template Attack on ECDSA from [29]. 
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Nb. of extra 
points 


S/M 


Left-to-right 
with (2) 


Left-to-right 
with (3) 


Right-to-left 
with (1) 





0.8 


17.7 


18.2 


20.0 


1 


17.7 


19.6 


20.0 


1 


0.8 


16.1 


16.9 


18.0 


1 


16.1 


18.2 


18.0 


2 


0.8 


15.6 


16.5 


17.3 


1 


15.6 


17.7 


17.3 


3 


0.8 


15.1 


16.1 


16.8 


1 


15.1 


17.4 


16.8 


4 


0.8 


14.9 


16.0 


16.6 


1 


14.9 


17.2 


16.6 



Table 3. Cost estimation in field multiplications per bit of the 3 atomically protected 
scalar multiplication algorithms with A/M = 0.2. 



It appears that in our context atomic left-to-right scalar multiplication using 
atomic pattern (2) with fast doubling and readditions is the fastest solution and 
is, on average for the 10 rows of Table 3, 10.5% faster than atomic right-to-left 
mixed scalar multiplication using atomic pattern (1). 

In the next section we present our contribution that aims at minimizing the 
atomicity cost by optimizing the atomic pattern. Then we apply it on the right- 
to-left mixed scalar multiplication algorithm since efficient patterns are already 
known for the two left-to-right variants. 

4 Atomic Pattern Improvement 

We propose here a twofold atomicity improvement method: firstly, we take ad- 
vantage of the fact that a squaring can be faster than a multiplication. Secondly, 
we reduce the number of additions and negations used in atomic patterns in 
order to increase the efficiency of scalar multiplication. 

4.1 First Part: Atomic Pattern Extension 

As explained previously, our first idea is to reduce the efficiency loss due to field 
squarings turned into multiplications. 

Method Presentation. Let Oi and O2 be two atomically written operations 
(point addition and doubling in our case) such that they require m and n atomic 
patterns respectively. Let us assume that a sub-operation o\ from the atomic 
pattern (field multiplication in our case) could sometimes be replaced by another 
preferred sub-operation 02 (such as field squaring) . Let us eventually assume that 
0\ requires at least ml sub-operations o\ (along with m — m! sub-operations 02) 
and O2 requires at least n' sub-operations o\ (along with n — n' sub-operations 
02). 
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Then, if d = gcd(m, n) > 1, let e represents the greatest positive integer 
satisfying: 

vn TL 
e • — < m — m! and e • — < n — n . (8) 
a a 

Since is obviously a solution, it is certain that e is defined. If e > we can now 
apply the following method. Let a new pattern be defined with d — e original 
atomic patterns followed by e atomic patterns with 02 replacing o\ - the order 
can be modified at convenience. 

It is now possible to express operations 0\ and O2 with m/d and n/d new 
patterns respectively. Using the new pattern in 0\ (resp. O2) instead of the old 
one allows replacing e • m/d (resp. e • n/d) sub-operations o\ by 02. 

Application to Mixed Coordinates Scalar Multiplication. Applying this 
method to Alg. 2 yields the following result: 0\ being the Jacobian projective 
addition, O2 the modified Jacobian projective doubling, o\ the field multiplica- 
tion and 02 the field squaring, then m=16, m'=ll, n = 8, n! = 3, d = 8 and 
e = 2. Therefore we define a new temporary atomic pattern composed of 8 pat- 
terns (1) where 2 multiplications are replaced by squarings. We thus have one 
fourth of the field multiplications carried out as field squarings. This extended 
pattern would have to be repeated twice for an addition and once for a doubling. 

We applied this new approach in Fig. 2 where atomic general Jacobian ad- 
dition and modified Jacobian doubling are rewritten in order to take advantage 
of the squarings. We denote by * the dummy field additions and negations that 
must be added to complete atomic patterns. 

4.2 Second Part: Atomic Pattern Cleaning-Up 

In a second step we aim at reducing the number of dummy field operations. In 
Fig. 2, we identified by the operations that are never used in Add.l, Add. 2 
and Dbl. These field operations may then be removed saving up 5 field additions 
and 3 field negations per pattern occurrence. 

However, we found out that field operations could be rearranged in order 
to maximize the number of rows over the three columns composed of dummy 
operations only. We then merge negations and additions into subtractions when 
possible. This improvement is depicted in Fig. 3. 
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Ri 



R 2 



Ri 

-k 
R 3 

Ri 



Ra 
* 

Ra 
Ra 
Ri 

Ri 

Ri 
Ri 



Z 2 2 



X\ ■ Ri 



Ri-Z 2 



Ri-X 2 

-Ra 
R2 + R4 
Z\ ■ R\ 



Ri-Y 2 

-Ri 
R3 + R1 



Add. 2 



Rs «- Ri 

-k 

■k 
k 

R s i-Zi-Za 
k 

Z 3 ^R 5 -Ri 

-k 

k 
k 

R 2 

-k 

Ri 

-k 

Rs 

Rs 

Ri 
Re 
R 2 
Re 
Rs 
X 3 

-k 

R 2 
Ri 
Ys 



Ri <- Xi 2 
R 2 ^ + Fi 



R 2 -Re 

-Ri 

Ri 2 

-Ra 

Ri ■ Re 
R 5 + Ri 
-Ra 
Re + R 2 
Rs-Ri 
R 2 + Re 

X 3 + R 2 
Ri - Ra 
Rs + Ri 



Dbl. 



Z 2 
Ra 



Rs 
Re 

* 

R 2 
Ri 
* 

Ri 
Rs 

* 

■k 
★ 

Ra 
Rb 
Ri 
Rs 
W 2 
X 2 
R 2 
Re 
Ri 
* 

Ri 

Y 2 



R 2 -Zi 
Ri + Ri 



R 2 -Y 1 
Rs + Rs 



Re-Rs 
Ri + Ri 

Ri+Wj 
Ri 2 



Re-Xt 
Wi + Wi 
-Ri 
Rs + Ri 
R 2 -Rs 
Rs+Ri 
-Ra 
Ri + X 2 

Re • Ri 

-Ri 
Ri + R 2 



Fig. 2. Extended atomic pattern applied to Jacobian projective addition and modified 
Jacobian projective doubling. 



14 



Rx 



Z 2 2 



R 2 <^Yi- Z 2 



R 3 <- Ri ■ R2 



Ra 
R2 



Re 



Zx 



Rs ■ Ri 



R 2 <- R2 - Ra 
R5 4— Ri • Xx 



X 2 -Ra 



Add. 2 



Re 4— Re — R5 



Ri 

-k 

■k 
k 

Ra 
* 

■k 

Rs 

-k 

k 
k 

Ri 

-k 

* 
k 

Re 

k 
k 



Re 2 



Rs-Ri 



Ri ■ Re 



ZxRe 



R2 1 



Z z <- Ri ■ Z 2 
Ri «- Ra + R4 

Re <— Re — Ri 
Ri R5 • ^3 
X 3 <— R% — R$ 

Ri^ Ra- X 3 
Ra ^ Ra- R2 



Y 3 <-R 3 - Ri 



Dbl. 



Ri <- Xi 2 
R2 <- Yx + Yx 



Z 2 

Ra 



Ri 
Rx 

* 

Rx 
R 3 
* 

* 

Ra 
R 5 



R 2 -Zx 
Rx + Rx 



Rs ^R 2 -Yx 
Re <-R 3 + R 3 



Re-Rs 
Ra + Rx 

Rx + Wx 
Rx 2 



Re-Xx 
Wx + Wx 



R 3 <- Ri- Ra 
W 2 ^- R 2 - R 5 
X 2 <- R 3 - Ra 

Re <- Ra- X 2 

R4 4 — Rq ■ R\ 



Y 2 <- Ra- R2 



Fig. 3. Improved arrangement of field operations in extended atomic pattern from 
Fie. 2. 
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This final optimization now allows us to save up 6 field additions and to 
remove the 8 field negations per pattern occurrence. One may note that no more 
dummy operation remains in modified Jacobian doubling. We thus believe that 
our resulting atomic pattern (4) is optimal for this operation: 



(4) 





<; — 


i?9 2 




1 L 3 


^ — 


i?4 - 


1- R^ 


R, 

1 M> 


^ — 






1 M3 




Rw 


+ Ru 


Rl2 


•<- 


Rl3 


■ R14 


Rl5 


<— 


R16 


+ R17 


Rl8 


<— 


Rl9 


■ R20 


R21 


<- 


R22 


+ R23 


R24 


<r- 


R25 


+ R26 


R27 


<r- 


R28 


2 


R29 


<- 


R30 


• R3I 


R32 


<r- 


R33 


+ R34 


R35 


<— 


R36 


- R37 


R38 


<— 


R39 


■ R40 


R41 


<- 


R42 


— R43 


R44 


<— 


R45 


— R46 


R47 


<— 


i?48 


■ R49 


R50 


<r- 


-R5I 


— R52 



4.3 Theoretical Gain 

In Table 4 we present the cost of right-to-left mixed scalar multiplication pro- 
tected with atomic pattern (4). We also draw up in this chart the gains obtained 
over left-to-right and right-to-left algorithms protected with atomic patterns (2) 
and (1) respectively. 



Nb. of extra 
points 


S/M 


Right-to-left 
with (4) 


Gain over 
l.-to-r. with (2) 


Gain over 
r.-to-l. with (1) 





0.8 


16.0 M 


9.6% 


20.0 % 


1 


16.7 M 


5.6% 


16.5 % 


1 


0.8 


14.4 M 


10.6 % 


20.0 % 


1 


15.0 M 


6.8% 


16.7% 


2 


0.8 


13.9 M 


10.9 % 


19.7% 


1 


14.4 M 


7.7% 


16.8 % 


3 


0.8 


13.4 M 


11.3% 


20.2 % 


1 


14.0 M 


7.3% 


16.7% 


4 


0.8 


13.3 M 


10.7% 


19.9 % 


1 


13.8 M 


7.4% 


16.9 % 



Table 4. Costs estimation in field multiplications per bit of Alg. 2 protected with 
improved pattern (4) and comparison with two others methods presented in Table 3 
assuming A/M = 0.2. 
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Due to our new atomic pattern (4), right-to- left mixed scalar multiplication 
turns out to be the fastest among these solutions in every cases. The average 
speed-up over pattern (1) is 18.3% and the average gain over left-to-right scalar 
multiplication protected with atomic pattern (2) is 10.6 % if dedicated squaring 
is available or 7.0 % otherwise. 

4.4 Experimental Results 

Wc have implemented Alg. 2 - without any window method - protected with 
the atomic pattern (1) on one hand and with our improved atomic pattern (4) 
on the other hand. We used a chip equipped with an 8-bit CPU running at 30 
MHz and with a 32-bit crypto-coproccssor running at 50 MHz. In particular, this 
crypto-coprocessor provides a dedicated modular squaring. The characteristics 
of the corresponding implementation are given in Table 5. On the NIST P-192 
curve [17] we obtained a practical speed-up of about 14.5% to be compared 
to the predicted 20%. This difference can be explained by the extra software 
processing required in the scalar multiplication loop management, especially the 
on-the-fly NAF decomposition of the scalar in an SSCA-resistant way. 



Timing 


RAM size 


Code size 


29.6 ms 


412 B 


3.5 KB 



Table 5. Characteristics of our implementation of the atomically protected 192-bit 
scalar multiplication on an 8-bit chip with a 32-bit crypto-coprocessor. 

When observing the side-channel leakage of our implementation we obtained 
the signal presented in Fig. 4. Atomic patterns comprising 8 modular multipli- 
cations and several additions/subtractions can easily be identified. 




Fig. 4. Side-channel leakage observed during the execution of our scalar multiplication 
implementation showing a sequence of atomic patterns. 
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5 Conclusion 



In this paper, we propose a new atomic pattern for scalar multiplication on el- 
liptic curves over ¥ p and detail our method for atomic pattern improvement. To 
achieve this goal, two ways are explored. Firstly we maximize the use of squarings 
to replace multiplications since the latter are slower. Secondly we minimize the 
use of field additions and negations since they induce a non-negligible penalty. 
In particular, we point out that the classical hypothesis taken by scalar multipli- 
cation designers to neglect the cost of additions/subtractions in ¥ p is not valid 
when focusing on embedded devices such as smart cards. 

In this context our method provides an average 18.3 % improvement for the 
right-to-left mixed scalar multiplication from [21] protected with the atomic 
pattern from [9] . It also provides an average 10.6 % gain over the fastest algorithm 
identified before our contribution if dedicated squaring is available. Furthermore, 
though the topic of this paper is right-to-left scalar multiplication, our atomic 
pattern improvement method can be generically used to speed up atomically 
protected algorithms. 

In conclusion we recommend that algorithm designers, addressing the scope 
of embedded devices, take into account additions and subtractions cost when 
these operations are heavily used in an algorithm. Moreover the issue of design- 
ing efficient atomic patterns should be considered when proposing non regular 
sensitive algorithms. 
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A Atomic Formulas 



A.l Atomic General Doubling Using Pattern (2) or (3) 

The decomposition of a general - i.e. for all a - doubling in Jacobian coordinates 
using atomic pattern (3) is depicted hereafter. The corresponding decomposition 
using atomic pattern (2) can straightforwardly be obtained by replacing every 
squaring by a multiplication using the same operand twice. 

The input point is given as {X\ ,Y\, Z±) and the result is written into the 
point (X2, Y2, Z2). Four intermediate registers, R\ to R4, are used. 



R 3 
* 
* 

z 2 

* 

R 3 
* 



Z\ ■ R2 
R 3 + R3 



~Ri 






~R 2 


<-R^ 








* 






<- Ri + Ri 




* 




R2 


<-Z x -Z x 


4 




<— i? 4 • i? 3 








i?4 


<- -Ri 


Ri 


<- Ri + R 3 




R2 


<— i?2 + R4 


i?4 


<-X x +X x 




x 2 


<— i?2 + Ri 




<- R2 2 




~R 3 


<r- R 3 2 


* 






Ri 


<r- -Rl 








i?4 


<r- X 2 + Ri 


R2 


<— a ■ R2 


5 


Ri 


4 — i?l • i?4 








R 3 


<- -R3 


Ri 


<-R x + R 2 




R 3 


<-R 3 + R 3 


R 2 


<-Y x +Y x 




Y 2 


<-R 3 + Ri 
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A. 2 Atomic Readdition Using Pattern (2) 



The decomposition of a readdition in Jacobian coordinates using atomic pattern 
(2) is depicted hereafter. 

The input points are given as (Xi,Yi,Zi), (X 2 ,Y 2 , Z 2 ) and the result is 
written into the point (Xa,Ya, Za). Seven intermediate registers, R± to R-?, are 
used. 



Z 2 ■ Zi 



Ri^Y 2 - Z x c 
* 
* 

R2 
R3 

i?4 <— R 2 ' R3 
R5 < — Rl 
R±^- Ri + R5 

R 2 X-2 ■ Z\ 

R 3 
Rr, 

R 3 
* 

Rb 

R 2 <— R2 
Rq < R 2 

i?6 *~ R& ■+ 



R 3 



Xt 

-R2 

R3 + ^5 



Ra ■ Ra 



R-, 



R 5 

R; 
* 

Rr 
* 

Za 
R5 
Ry 
X 3 

Ra 
Ri 
R 2 

Ri 

Y 3 



R5 ■ R3 



<— Z 2 ■ Ra 



i?4 • i?4 



i?3 • Zi 

-Rs 
R7 + Re 
Ri + R5 

Ri • R5 
-X 3 
R 2 + Ri 
R 2 ■ Ra 

Ri + R 3 



Rr 
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A. 3 Atomic Modified Jacobian Coordinates Doubling Using 
Pattern (1) 



The decomposition of a doubling in modified Jacobian coordinates using atomic 
pattern (1) is depicted hereafter. 

The input point is given as {X\ ,Y\,Z\) and the result is written into the 
point (X 2 , Y 2 ,Z 2 ). Six intermediate registers, R\ to Rq, are used. 



R\ i — X\ ■ X\ 
R 2 ^Y 1 + Yi 

* 



Z 2 <i— R 2 ■ Z\ 
Ri 

* 

R 3 
R() 
* 

R 2 i?6 
R\ i — i?4 

R\ i — ill 



iti + R\ 



R 2 -Y 1 

i?3 + i?3 



R 3 
- R\ 

-W-i 



i?3^ 

* 
* 

i?4 <r- 
R 5 <" 

R± <- 

i?3^ 

W 2 «- 

X 2 «- 

R 2 «- 
Re «- 

Ri «- 
* 

Ri <- 
Y 2 <- 



iii • iii 



Re ■ X± 
Wi + Wi 

— i?4 
i?3 + i?4 

R 2 ■ R 5 

i?3 + i?4 

-Ri 
Ri + X 2 

Rq ■ Ri 

-Ri 

Ri + i?2 



23 



